Goto

Collaborating Authors

 final decision


How to Find Fantastic AI Papers: Self-Rankings as a Powerful Predictor of Scientific Impact Beyond Peer Review

Su, Buxin, Collina, Natalie, Wen, Garrett, Li, Didong, Cho, Kyunghyun, Fan, Jianqing, Zhao, Bingxin, Su, Weijie

arXiv.org Artificial Intelligence

Peer review in academic research aims not only to ensure factual correctness but also to identify work of high scientific potential that can shape future research directions. This task is especially critical in fast-moving fields such as artificial intelligence (AI), yet it has become increasingly difficult given the rapid growth of submissions. In this paper, we investigate an underexplored measure for identifying high-impact research: authors' own rankings of their multiple submissions to the same AI conference. Grounded in game-theoretic reasoning, we hypothesize that self-rankings are informative because authors possess unique understanding of their work's conceptual depth and long-term promise. To test this hypothesis, we conducted a large-scale experiment at a leading AI conference, where 1,342 researchers self-ranked their 2,592 submissions by perceived quality. Tracking outcomes over more than a year, we found that papers ranked highest by their authors received twice as many citations as their lowest-ranked counterparts; self-rankings were especially effective at identifying highly cited papers (those with over 150 citations). Moreover, we showed that self-rankings outperformed peer review scores in predicting future citation counts. Our results remained robust after accounting for confounders such as preprint posting time and self-citations. Together, these findings demonstrate that authors' self-rankings provide a reliable and valuable complement to peer review for identifying and elevating high-impact research in AI.


Who Has The Final Say? Conformity Dynamics in ChatGPT's Selections

Arlinghaus, Clarissa Sabrina, Kenneweg, Tristan, Hammer, Barbara, Maier, Günter W.

arXiv.org Artificial Intelligence

Large language models (LLMs) such as ChatGPT are increasingly integrated into high-stakes decision-making, yet little is known about their susceptibility to social influence. We conducted three preregistered conformity experiments with GPT-4o in a hiring context. In a baseline study, GPT consistently favored the same candidate (Profile C), reported moderate expertise (M = 3.01) and high certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT faced unanimous opposition from eight simulated partners and almost always conformed (99.9%), reporting lower certainty and significantly elevated self-reported informational and normative conformity (p < .001). In Study 2 (GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of disagreement trials, reporting less certainty and more normative conformity. Across studies, results demonstrate that GPT does not act as an independent observer but adapts to perceived social consensus. These findings highlight risks of treating LLMs as neutral decision aids and underline the need to elicit AI judgments prior to exposing them to human opinions.


E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing

Jeong, Cheonsu, Sim, Seongmin, Cho, Hyoyoung, Kim, Sungsu, Shin, Byounggwan

arXiv.org Artificial Intelligence

This paper presents an intelligent work automation approach in the context of contemporary digital transformation by integrating generative AI and Intelligent Document Processing (IDP) technologies with an Automation Agent to realize End-to-End (E2E) automation of corporate financial expense processing tasks. While traditional Robotic Process Automation (RPA) has proven effective for repetitive, rule-based simple task automation, it faces limitations in handling unstructured data, exception management, and complex decision-making. This study designs and implements a four-stage integrated process comprising automatic recognition of supporting documents such as receipts via OCR/IDP, item classification based on a policy-driven database, intelligent exception handling supported by generative AI (large language models, LLMs), and human-in-the-loop final decision-making with continuous system learning through an Automation Agent. Applied to a major Korean enterprise (Company S), the system demonstrated quantitative benefits including over 80% reduction in processing time for paper receipt expense tasks, decreased error rates, and improved compliance, as well as qualitative benefits such as enhanced accuracy and consistency, increased employee satisfaction, and data-driven decision support. Furthermore, the system embodies a virtuous cycle by learning from human judgments to progressively improve automatic exception handling capabilities. Empirically, this research confirms that the organic integration of generative AI, IDP, and Automation Agents effectively overcomes the limitations of conventional automation and enables E2E automation of complex corporate processes. The study also discusses potential extensions to other domains such as accounting, human resources, and procurement, and proposes future directions for AI-driven hyper-automation development.


From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

Huang, Yuzhen, Zeng, Weihao, Zeng, Xingshan, Zhu, Qi, He, Junxian

arXiv.org Artificial Intelligence

Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct, particularly after fine-tuning. This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique challenges inherent to both rule-based and model-based verifiers and provide insights toward developing more accurate and robust reward systems for reinforcement learning.


Accountability Framework for Healthcare AI Systems: Towards Joint Accountability in Decision Making

Bagave, Prachi, Westberg, Marcus, Janssen, Marijn, Ding, Aaron Yi

arXiv.org Artificial Intelligence

AI is transforming the healthcare domain and is increasingly helping practitioners to make health-related decisions. Therefore, accountability becomes a crucial concern for critical AI-driven decisions. Although regulatory bodies, such as the EU commission, provide guidelines, they are highlevel and focus on the ''what'' that should be done and less on the ''how'', creating a knowledge gap for actors. Through an extensive analysis, we found that the term accountability is perceived and dealt with in many different ways, depending on the actor's expertise and domain of work. With increasing concerns about AI accountability issues and the ambiguity around this term, this paper bridges the gap between the ''what'' and ''how'' of AI accountability, specifically for AI systems in healthcare. We do this by analysing the concept of accountability, formulating an accountability framework, and providing a three-tier structure for handling various accountability mechanisms. Our accountability framework positions the regulations of healthcare AI systems and the mechanisms adopted by the actors under a consistent accountability regime. Moreover, the three-tier structure guides the actors of the healthcare AI system to categorise the mechanisms based on their conduct. Through our framework, we advocate that decision-making in healthcare AI holds shared dependencies, where accountability should be dealt with jointly and should foster collaborations. We highlight the role of explainability in instigating communication and information sharing between the actors to further facilitate the collaborative process.


Multiple LLM Agents Debate for Equitable Cultural Alignment

Ki, Dayeon, Rudinger, Rachel, Zhou, Tianyi, Carpuat, Marine

arXiv.org Artificial Intelligence

Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).


DecisionFlow: Advancing Large Language Model as Principled Decision Maker

Chen, Xiusi, Wang, Shanyong, Qian, Cheng, Wang, Hongru, Han, Peixuan, Ji, Heng

arXiv.org Artificial Intelligence

In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model's reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. Code and data are at https://github.com/xiusic/DecisionFlow.


ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions

Cho, Gyeongje, So, Yeonkyoung, Lee, Jaejin

arXiv.org Artificial Intelligence

Multiple-choice benchmarks, consisting of various prompts and choices, are among the most widely used methods to assess a language model's natural language understanding capability. Given a specific prompt, we typically compute $P(Choice|Prompt)$ to evaluate how likely a language model is to generate the correct choice compared to incorrect ones. However, we observe that performance measured using this approach reflects not only the model's comprehension of the prompt but also its inherent biases for certain choices regardless of the prompt. This issue makes it challenging to accurately measure a model's natural language understanding, as models may select the answer without fully understanding the prompt. To address this limitation, we propose a novel metric called ANPMI, which normalizes Pointwise Mutual Information (PMI) by $-\log P(Choice)$. ANPMI provides a more accurate assessment of the model's natural language understanding by ensuring that it is challenging to answer a question without properly understanding the prompt.


Automatically Evaluating the Paper Reviewing Capability of Large Language Models

Shin, Hyungyu, Tang, Jingyu, Lee, Yoonjoo, Kim, Nayoung, Lim, Hyunseung, Cho, Ji Yong, Hong, Hwajung, Lee, Moontae, Kim, Juho

arXiv.org Artificial Intelligence

Peer review is essential for scientific progress, but it faces challenges such as reviewer shortages and growing workloads. Although Large Language Models (LLMs) show potential for providing assistance, research has reported significant limitations in the reviews they generate. While the insights are valuable, conducting the analysis is challenging due to the considerable time and effort required, especially given the rapid pace of LLM developments. To address the challenge, we developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews. By constructing a dataset consisting of 676 OpenReview papers, we examined the agreement between LLMs and experts in their strength and weakness identifications. The results showed that LLMs lack balanced perspectives, significantly overlook novelty assessment when criticizing, and produce poor acceptance decisions. Our automated pipeline enables a scalable evaluation of LLMs' paper review capability over time.


As Confidence Aligns: Exploring the Effect of AI Confidence on Human Self-confidence in Human-AI Decision Making

Li, Jingshu, Yang, Yitian, Liao, Q. Vera, Zhang, Junti, Lee, Yi-Chieh

arXiv.org Artificial Intelligence

Complementary collaboration between humans and AI is essential for human-AI decision making. One feasible approach to achieving it involves accounting for the calibrated confidence levels of both AI and users. However, this process would likely be made more difficult by the fact that AI confidence may influence users' self-confidence and its calibration. To explore these dynamics, we conducted a randomized behavioral experiment. Our results indicate that in human-AI decision-making, users' self-confidence aligns with AI confidence and such alignment can persist even after AI ceases to be involved. This alignment then affects users' self-confidence calibration. We also found the presence of real-time correctness feedback of decisions reduced the degree of alignment. These findings suggest that users' self-confidence is not independent of AI confidence, which practitioners aiming to achieve better human-AI collaboration need to be aware of. We call for research focusing on the alignment of human cognition and behavior with AI.